{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SQL for Exploratory Data Analysis Essential Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1) EDA\n",
    "\n",
    "### Why explore data?\n",
    "- help understand the quality of data\n",
    "- identify missing values\n",
    "- look for unusual or unexpected values\n",
    "- look for inconsistent data, especially with regard to business rules\n",
    "- check distribution of data, its shape\n",
    "- understand subgroups using histograms\n",
    "- check correlations between variables,using the Pearson Correlation Coefficient\n",
    "\n",
    "\n",
    "### Exploring data with statistics\n",
    "- Central Tendency (Mean, Mode)\n",
    "- Spread (Bell curve, long tail curve, etc) or Variance, Std Dev\n",
    "\n",
    "### Testing hypothesis with statistics\n",
    "- Example: customers over the age of 40 make more purchase than customers over age of 30\n",
    "\n",
    "`SELECT Avg(number_purchase), Avg(amount_purchase) FROM customer_summary\n",
    "WHERE age > 40`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "------\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2) Data Quality Checks\n",
    "\n",
    "### Why check data?\n",
    "- Missing values\n",
    "- values outstides of expected range (Ages less than 0 or greater than 120, etc)\n",
    "- malformed values (names and descriptions got truncated, date, etc)\n",
    "- values that violate business rules (order total > customer's credit limit)\n",
    "- inconsistent values across tables, such as customerID\n",
    "\n",
    "------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Types of quality checks\n",
    "\n",
    "### Missing Values Check\n",
    "`SELECT * FROM store_values\n",
    "WHERE units_sold IS NULL`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Out of Range Values\n",
    "`SELECT * FROM employees\n",
    "WHERE (age < 0) AND (age > 40)`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`SELECT * FROM store_sales\n",
    "WHERE employee_shift > 10 OR\n",
    "employee_shift < 0`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inconsistent Data Check\n",
    "- Postal code check\n",
    "- zip code (010-027) are in Massachusetts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Business Logic\n",
    "- check to make sure no enrollees in an insurance plan have enorllment data earlier than date, plan was created\n",
    "- employee start date can't be earlier than company start date, etc\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imputing Missing Values\n",
    "\n",
    "- continue without replacing missing values\n",
    "- define a default value\n",
    "- compute value based on neighboring rows\n",
    "    - group rows by some criteria and average value using nearby rows\n",
    "    - sort rows by some order so that similar rows are next to each other, and then copy the value of missing column from the row before / row after\n",
    "- compute value based on regression\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "### Identifying Business Logic Checks\n",
    "\n",
    "- `Date`: reasonable range or not, compare dates, whether future dates are reasonable or not\n",
    "- `Numeric columns`: range of values are within what would be expected, look for outliers, calculate STDDEV\n",
    "- `Text`: if categorical => should be small number of distinct values, if largely unique => values should be close to the # of rows in data set, check length and look for unsually long or short ones (check for at least 10 upper/lower outliers)\n",
    "\n",
    "#### Standard Deviation Check \n",
    "- calculate mean of a column\n",
    "- calculate std dev \n",
    "- set upper limit (mean + 3 * std-dev)\n",
    "- set lower limit (mean - 3t * std-dev)\n",
    "- any values > or < those limits are outliers\n",
    "\n",
    "-----\n",
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3) Calculating Quartiles\n",
    "\n",
    "## learn about distribution of data \n",
    "- plot the data series and check normal/left/right skewed)\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Check for MInium, Maximum and Median\n",
    "\n",
    "`SELECT MIN(age) FROM employees`\n",
    "\n",
    "`SELECT MIN(age), MIN (start_date), MIN(num_vacation_days)\n",
    "FROM employees`\n",
    "\n",
    "`SELECT MAX(age) FROM employees`\n",
    "\n",
    "##### oracle\n",
    "`SELECT MEDIAN(age) FROM employees`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### postgres\n",
    "![postgres](median.png)\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ordering and Counting\n",
    "\n",
    "`SELECT age FROM employees\n",
    "ORDER BY age`\n",
    "\n",
    "`SELECT age FROM employees\n",
    "ORDER BY age DESC`\n",
    "\n",
    "`SELECT age, start_date FROM employees\n",
    "ORDER BY age, start_date`\n",
    "\n",
    "`SELECT COUNT(age) FROM employees` # return non null values\n",
    "\n",
    "###### Mix Aggregrate\n",
    "`SELECT COUNT(age), COUNT(start_date), MAX(age), MIN(start_date)\n",
    "FROM employees`\n",
    "\n",
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating quartiles and Box plots\n",
    "\n",
    "- use median for visualizing the data 1st, 2nd, 3rd, 4th quartiles\n",
    "- IRQ= 3rd - 1st \n",
    "\n",
    "-----\n",
    "\n",
    "--------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Histograms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Partitioning Data using SQL to create histograms\n",
    "- we can create bins \n",
    "- for example: Annual Insurances cost by ages (we can assume that 6 and 7 years old may have a smiliar cost , 71 and 72 years old may have the similar too).\n",
    "- we can divide the value by the bin size and `floor` to the whole integer value.\n",
    "\n",
    "------\n",
    "\n",
    "![bucket](bucket.png)\n",
    "\n",
    "----\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "`SELECT FLOOR(age/5) as bucket FROM insured_persons`\n",
    "\n",
    "`SELECT FLOOR(age/5) as bucket, FLOOR(age/5) * 5 AS bucket_floor\n",
    "FROM insured_persons`\n",
    "\n",
    "![newbucket](newbucket.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating Histograms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`SELECT\n",
    "FLOOR(age/5) * 5 AS bucket_floor,\n",
    "AVG(annual_cost)\n",
    "FROM insured_persons\n",
    "GROUP BY bucket_floor\n",
    "ORDER BY bucket_floor`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----\n",
    "## Simple Histogram Visualization\n",
    "\n",
    "#### SQL Repeat Function\n",
    "`REPEAT('*', 10): **********`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`CAST (variable AS INTEGER)`\n",
    "\n",
    "`SELECT\n",
    "FLOOR(age/5) * 5 AS bucket_floor, \n",
    "AVG(annual_cost),\n",
    "REPEAT('*', CAST(ROUND(FLOOR(age/5)*5) / 100)) AS INTEGER) AS bar\n",
    "FROM insured_persons\n",
    "GROUP BY bucket_floor\n",
    "ORDER BY bucket_floor`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![histogram](histogram.png)\n",
    "\n",
    "----------\n",
    "----------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5) Checking Correlation between Attributes\n",
    "- scatter plot (positive , negative correlation, no correlation)\n",
    "\n",
    "\n",
    "## Calculating correlation with SQL\n",
    "- -1: negatively correlated\n",
    "- +1: positively correlated\n",
    "- 0: not correlated\n",
    "\n",
    "#### positively correlated\n",
    "`SELECT CORR(units_sold, total_revenue)\n",
    "FROM product_sales`\n",
    "\n",
    "\n",
    "#### negatively correlated\n",
    "`SELECT CORR(rooms_booked, rooms_avaliable)\n",
    "FROM hotel_rooms`\n",
    "\n",
    "#### not correlated\n",
    "`SELECT CORR(distance_to_store, amount_spend)\n",
    "FROM hotel_rooms`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------\n",
    "---------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv-datascience",
   "language": "python",
   "name": "venv-datascience"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
